Skip to content

Conversation

wangxiyuan
Copy link
Collaborator

@wangxiyuan wangxiyuan commented Sep 29, 2025

What this PR does / why we need it?

This PR added the initial DeepSeek V3.2 support with vLLM v0.11.0 (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days.

Related doc: #3223 .

Does this PR introduce any user-facing change?

Yes!

How was this patch tested?

CI passed and Run deepseek doc soon.

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@Yikun Yikun changed the title New model Init v0.11.0 related code Sep 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Sparse Flash Attention (SFA) mechanism, primarily for DeepSeek V2 models on Ascend hardware. The changes are extensive, adding new attention backends, modifying model definitions, and updating distributed communication logic to support the new KV cache structure for SFA. While the implementation adds significant new functionality, I have identified a critical issue in multi-stream metadata handling that could lead to incorrect attention calculations. Additionally, there are several instances of hardcoded configuration values that make the implementation brittle and prone to errors with different model variants. I've also flagged a few TODO comments in critical code paths that suggest the logic may be incomplete or unverified. Addressing these points is crucial for the stability and correctness of the new SFA feature.

Comment on lines +157 to +161
return model_input_split_v1_mla_attn(
ms_split_config=ms_split_config,
attn_metadata=self,
_metadata_cls=AscendMLAMetadata,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The split_metadata_for_multistream method is incorrectly using AscendMLAMetadata as the _metadata_cls when splitting metadata. Since this method is part of AscendSFAMetadata, it should be using its own class, AscendSFAMetadata, to ensure the correct metadata structure is propagated in multi-stream scenarios. Using the wrong metadata class can lead to incorrect attention calculations or runtime errors.

Suggested change
return model_input_split_v1_mla_attn(
ms_split_config=ms_split_config,
attn_metadata=self,
_metadata_cls=AscendMLAMetadata,
)
return model_input_split_v1_mla_attn(
ms_split_config=ms_split_config,
attn_metadata=self,
_metadata_cls=AscendSFAMetadata,
)

prefill_q_pe = torch_npu.npu_interleave_rope(
prefill_q_pe, cos_q, sin_q) # BNSD
prefill_q_pe = prefill_q_pe.squeeze(2) #BSH
# q[..., self.qk_nope_head_dim:] = prefill_q_pe # TODO:????
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line contains a TODO:???? comment, which indicates uncertainty about the implementation. This part of the code seems related to applying rotary position embeddings. Leaving such a TODO suggests the logic might be incomplete or incorrect, which could lead to erroneous attention results. This should be investigated and resolved before merging.

def mla_epilog(self,
attn_output: torch.Tensor = None,
absorb: bool = False):
# TODO: need to check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

A TODO: need to check comment is present in the mla_epilog function. This indicates that the logic in this function may not have been fully verified. Unchecked code, especially in critical paths like the attention epilogue, poses a risk and should be reviewed and confirmed to be correct before merging.

Comment on lines +366 to +369
# TODO(zzzzwwjj): wait transformers add these params
self.n_heads: int = 64 # 64
self.head_dim: int = 128 # 128
self.index_topk: int = 2048 # 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parameters n_heads, head_dim, and index_topk for the indexer are hardcoded. This makes the implementation brittle and may cause silent errors or incorrect behavior if used with model variants that have different configurations. These values should be derived from the model config object rather than being hardcoded.

Comment on lines +169 to +178
def __init__(self,
config,
dim: int = 7168,
n_heads: int = 64,
head_dim: int = 128,
index_topk: int = 2048,
q_lora_rank: int = 1536,
rope_head_dim: int = 64,
quant_config: Optional[QuantizationConfig] = None,
prefix: Optional[str] = ""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Indexer class __init__ method uses hardcoded default values for several dimensional parameters like dim, n_heads, head_dim, etc. This is risky as it might not match the actual model configuration passed during instantiation, leading to silent errors or incorrect behavior. These parameters should be passed explicitly from the model's configuration during initialization, and the defaults should be used cautiously.

Comment on lines +26 to +27
sfa_bytes = 128 * self.block_size * get_dtype_size(
self.dtype) if self.use_sfa else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The value 128 is hardcoded in the sfa_bytes calculation. This value appears to correspond to the indexer's head dimension. Using magic numbers reduces code maintainability and can lead to bugs if the indexer's configuration changes. This value should be retrieved from a configuration source or defined as a named constant to improve clarity and robustness.

Comment on lines +768 to +772
self.dim: int = config.hidden_size # 7168
# TODO(zzzzwwjj): wait transformers add these params
self.n_heads: int = 64 # 64
self.head_dim: int = 128 # 128
self.index_topk: int = 2048 # 2048
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parameters dim, n_heads, head_dim, and index_topk for the indexer are hardcoded. This makes the implementation brittle and may cause issues if used with model variants that have different configurations. These values should be derived from the model config object instead of being hardcoded to ensure the implementation is robust.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Sparse Flash Attention (SFA) backend for Ascend NPUs, specifically for the DeepSeek-V2 model. The changes include the core SFA implementation, metadata builders, torchair integration, and patches to vLLM to select and configure this new backend. While the implementation is extensive, I've found several critical issues related to incorrect path selection between prefill and decode operations in mixed batches, which could lead to incorrect attention results. There are also some maintainability concerns with hardcoded values. Addressing these issues is crucial for the correctness and robustness of the new backend.

Comment on lines +845 to +872
def apply_attention_fusion(self, query_states, key_states, topk_indices,
attn_metadata: M):
# repeat k/v heads if n_kv_heads < n_heads
q_nope, q_pe = query_states
k_nope, k_rope = key_states

if attn_metadata.prefill is not None:

prefill_metadata = attn_metadata.prefill

slc_fa_fusion = torch.ops.custom.npu_sparse_flash_attention(
query=q_nope,
key=k_nope,
value=k_nope,
sparse_indices=topk_indices,
scale_value=self.scale,
sparse_block_size=1,
block_table=prefill_metadata.block_table,
actual_seq_lengths_query=prefill_metadata.query_lens,
actual_seq_lengths_kv=prefill_metadata.seq_lens,
query_rope=q_pe,
key_rope=k_rope,
layout_query="TND",
layout_kv="PA_BSND",
sparse_mode=3,
)

elif attn_metadata.decode is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic to distinguish between prefill and decode paths is incorrect. In a mixed batch with both prefill and decode requests, attn_metadata.prefill will be non-None. This will cause the decode_attn_output computation to incorrectly enter the prefill branch, leading to wrong attention results. The elif for the decode path will never be reached in a mixed batch.

To fix this, you should add a boolean flag, e.g., is_prefill, to the apply_attention_fusion function signature. Then, use this flag to select the correct code path.

def apply_attention_fusion(self, query_states, key_states, topk_indices,
                           attn_metadata: M, is_prefill: bool):
    # ...
    if is_prefill:
        prefill_metadata = attn_metadata.prefill
        assert prefill_metadata is not None
        # ... prefill logic
    else: # is_decode
        decode_metadata = attn_metadata.decode
        assert decode_metadata is not None
        # ... decode logic

You will also need to update the call sites in the forward method to pass this flag:

  • is_prefill=False for the decode call (line 827).
  • is_prefill=True for the prefill call (line 835).

Comment on lines +918 to +924
def indexer_select(
self,
x: torch.Tensor,
qr: torch.Tensor,
kv_cache: Tuple[torch.Tensor, torch.Tensor, torch.Tensor],
attn_metadata: M,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function has the same issue as apply_attention_fusion. The if attn_metadata.prefill is not None: check is not sufficient to distinguish between prefill and decode paths in a mixed batch. This will cause incorrect metadata to be used for decode requests.

Please add an is_prefill flag to the function signature and use it to correctly select the metadata.

def indexer_select(
    self,
    x: torch.Tensor,
    qr: torch.Tensor,
    kv_cache: Tuple[torch.Tensor, torch.Tensor, torch.Tensor],
    attn_metadata: M,
    is_prefill: bool,
):
    if is_prefill:
        prefill_metadata = attn_metadata.prefill
        assert prefill_metadata is not None
        cos = prefill_metadata.cos
        # ...
    else: # is_decode
        decode_metadata = attn_metadata.decode
        assert decode_metadata is not None
        cos = decode_metadata.cos
        # ...

The call sites in _sfa_preprocess should be updated to pass is_prefill=False for the decode path (line 690) and is_prefill=True for the prefill path (line 778).

# Profiling run.
return output

if attn_metadata.prefill is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition if attn_metadata.prefill is not None: is incorrect for handling mixed batches. In a batch with both prefill and decode requests, this condition will always be true, and the decode path in the elif block (line 1099) will never be reached. This will cause decode requests to be processed with prefill logic, which is incorrect.

The AscendSFATorchairMetadataBuilder sets is_prefill and is_decode flags. Please use if attn_metadata.is_prefill: here, and elif attn_metadata.is_decode: for the decode block to ensure correct path selection.

Comment on lines +1218 to +1225
):
if attn_metadata.prefill is not None:
cos = attn_metadata.prefill.cos
sin = attn_metadata.prefill.sin
elif attn_metadata.decode is not None:
cos = attn_metadata.decode.cos
sin = attn_metadata.decode.sin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The is_prefill parameter is passed to this function but it is not used. Instead, the function uses if attn_metadata.prefill is not None:, which is incorrect for mixed batches and will lead to using prefill metadata for decode requests.

Please refactor this function to use the is_prefill flag to correctly select the metadata from either attn_metadata.prefill or attn_metadata.decode.

Comment on lines +26 to +27
sfa_bytes = 128 * self.block_size * get_dtype_size(
self.dtype) if self.use_sfa else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The value 128 is hardcoded for calculating sfa_bytes. This seems to correspond to the head_dim of the SFA indexer. Hardcoding this value makes the code less maintainable and harder to adapt for models with different indexer dimensions.

This value should be retrieved from the model configuration and passed to AttentionSpec instead of being hardcoded.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for a new Sparse Flash Attention (SFA) backend, primarily for DeepSeek V2 models on Ascend NPUs. The changes are extensive, adding new attention implementations for both eager mode (sfa_v1.py) and torch.compile/torchair mode (torchair_sfa.py), along with necessary modifications across the model definition, configuration, and distributed components. While the implementation is comprehensive, I've identified several critical and high-severity issues related to correctness and maintainability, including incorrect metadata class usage, type mismatches, hardcoded values, and significant code duplication. These issues should be addressed to ensure the stability and correctness of the new SFA backend. The pull request description is also missing, which would be very helpful for understanding the context and testing of these complex changes.

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 29, 2025
@Yikun Yikun changed the title Init v0.11.0 related code Add DeepSeek V3.2 support Sep 29, 2025
zzzzwwjj and others added 15 commits September 29, 2025 21:48
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: zzzzwwjj <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wxsIcey <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
…or deepseek_r1

Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
wangxiyuan and others added 7 commits September 29, 2025 22:49
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
@wangxiyuan
Copy link
Collaborator Author

ut passed here: #3277

@wangxiyuan
Copy link
Collaborator Author

@wangxiyuan wangxiyuan merged commit 81bd6e4 into vllm-project:main Sep 29, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:core ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants